Project 3 - Exploring Prosper Loan Data by Chenchen Li

This report explores a dataset of prosper loan data. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. In this report, I selected 23 variables and created a few new variables through exploratory analysis.

========================================================

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...
## 'data.frame':    113937 obs. of  25 variables:
##  $ Term                     : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 3 2 2 2 2 ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerAPR              : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate             : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ ProsperRating..Alpha.    : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ListingCategory..numeric.: int  0 2 0 16 2 1 1 2 7 7 ...
##  $ Occupation               : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus         : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner      : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CreditScoreRangeLower    : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper    : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ CurrentCreditLines       : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ RevolvingCreditBalance   : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization      : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ DebtToIncomeRatio        : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange              : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable         : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome      : num  3083 6125 2083 2875 9583 ...
##  $ LoanNumber               : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate      : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ MonthlyLoanPayment       : num  330 319 123 321 564 ...
##  $ ProsperRating            : Factor w/ 7 levels "AA","A","B","C",..: NA 2 NA 2 5 3 6 4 1 1 ...
##  $ ListingCategory          : Factor w/ 21 levels "0 - Not Available",..: 1 3 1 17 3 2 2 3 8 8 ...

Univariate Plots Section

1 Loan amount

Several sharp line on whole number, such as 10000, 15000 and 20000. It is reasonable because people tend to borrow whole numbers of amount. Next, let’s take a look at the statistic of loan amount.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

The minimum loan is 1000, with the median of 6500 and mean of 8337, the maximum loan requested is 35000.

The most common loan amount.

## [1] 4000

Interesting to note that 4000 is the most common amount people borrowed, followed by 10000 and 15000.

2 Loan Payments

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2251.5

Monthly loan payments exhibits positive skew. The minimum monthly payment is 0, with the median of 217.7 and mean of 272.5, the maximum monthly payment is 2251.5.

3 State Monthly Income

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003

Monthly income exhibits positive skew. The minimum value is 0, with the median of 4667 and mean of 5608, the maximum monthly payment is 1750003.

4 Debt to Income Ratio

A debt income ratio is the percentage of a consumer’s monthly gross income that goes toward paying debts.

Removing the upper quantile on the data we got:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

DI ratio exhibits slight positive skew. The maximum is 10.01. The data is capped at 10.01, debt-to-income ratio larger then 1000% will be returned as 1001%. The minimum value is 0. With the median of 0.22 and mean of 0.276.

5 Credit Scores

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591

Credit ranges are normally distributed with mean/med = 685/680. Note: Only CreditScoreRangeLower is used. CreditScoreRangeHigher has very similar distribution but is 20 points higher.

6 Estimated monthly debt

I created a new variable TotalMonthlyDebt, to estimate the monthly debt at inception of loan. (TotalMonthlyDebt = DebtToIncomeRatio * StatedMonthlyIncome)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      0.0    637.5   1061.5   1224.9   1622.4 171004.2     8554

The resulting distribution is positively skewed.

7 Employment Status

##                    Employed     Full-time Not available  Not employed 
##          2255         67322         26355          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3806          1088           795          6134

There are 67322 employed borrowers or about 59% of the total borrowers.

8 Employment Status Duration

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.07  137.00  755.00    7625

Employment duration exhibits positive skew with mean 96 and median 67. Most people are employed shorter than 1 year.

9 Loan Category

## # A tibble: 21 x 3
##    ListingCategory..numeric. DollarTotal CountTotal
##                        <int>       <int>      <int>
##  1                         1   577736197      58308
##  2                         0   106096621      16965
##  3                         3    64175191       7189
##  4                         7    62035775      10494
##  5                         2    60148466       7433
##  6                         6    12861665       2572
##  7                         4    10913226       2395
##  8                        13    10549832       1996
##  9                        15     9929936       1522
## 10                        14     7684187        876
## # ... with 11 more rows

Most people borrow to consolidate their debts, in total there are 58308 case or about 51.17%. Interestingly it seems there are a lot of loans not classified. More on this in the multivarate section.

10 Yearly total loan amount

## # A tibble: 10 x 3
##    LoanYear DollarTotal CountTotal
##    <fct>          <int>      <int>
##  1 2014       144995536      12172
##  2 2013       362170278      34345
##  3 2012       153175116      19553
##  4 2011        75138013      11228
##  5 2010        26940486       5652
##  6 2009         8914396       2047
##  7 2008        69561850      11552
##  8 2007        80787786      11460
##  9 2006        28132199       5906
## 10 2005           78687         22

Here the loan origination dates are decomposed using lubridate and plotted by month and year. It is very interesting that the loans by year plot and table shows a drop off in 2009.

11 Revolving Credit Balance

Revolving Credit Balance is the total outstanding balance that the borrower owes on open credit cards or other revolving credit accounts.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3121    8549   17599   19521 1435667    7604

The median is 8549 and mean 17600. The maximum value is 1436000. The minimum and the most common amount is 0.

12 Bankcard Utilization

Bankcard utilization is the sum of the balances owed on open bankcards divided by the sum of the card’s credit limits. Lower usually means better.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.310   0.600   0.561   0.840   5.950    7604

There is a peak in the plot near 100% which means a lot of borrowers who have almost 100% Bankcard utilization. There are some borrowers who have utilization > 1.00 (100%).

Number of borrowers with BankcardUtilization < 0.05:

## [1] 9361

Number of borrowers near 1:

## [1] 9532

Number of borrowers with BankcardUtilization >= 1:

## [1] 2574

There are 2574 borrowers who has bankcard utilization > 1. That means they owed more then the credit limit.

13 Loan Status

## 
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

Most loans in the whole data set are completed and current. Later I will break it down by year.

14 Loans by Term

## [1] "# of Loans by Term (months)"
##    12    36    60 
##  1614 87778 24545

Most people have 3 year loans.

15 Loan counts vs year by term

Most people have 3 years loan. 5 years loan starts to show up from 2011, and the count of 5 years loan continue to increase until 2014

16 Loan status broken down by year

Most defaults occurred in 2006-2008.

17 Past Due Loans

The loans that were labelled ‘Past Due’ peaked and plateaued from 2012 to 2013.

18 Employment & Homeowner

## [1] "Is the borrower a homeowner? "
## False  True 
## 56459 57478

Borrowers are mostly employed and about half are homeowners.

Univariate Analysis

What is the structure of your dataset?

There are 113,937 loans in the dataset, 23 main features out of 81 were selected in this report and listed below:

  • Term
  • LoanStatus
  • BorrowerAPR
  • BorrowerRate
  • ListingCategory..numeric.
  • Occupation
  • EmploymentStatus
  • EmploymentStatusDuration
  • IsBorrowerHomeowner
  • CreditScoreRangeLower
  • CreditScoreRangeUpper
  • CurrentCreditLines
  • RevolvingCreditBalance
  • BankcardUtilization
  • DebtToIncomeRatio
  • ProsperRating
  • IncomeRange
  • IncomeVerifiable
  • StatedMonthlyIncome
  • LoanNumber
  • LoanOriginalAmount
  • LoanOriginationDate
  • MonthlyLoanPayment

What is/are the main feature(s) of interest in your dataset?

The main features for the loans are loan amount, term and borrowerRate. The main features for borrowers are credit score, debt to income ratio, monthlyLoanPayment and income.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Loan status and category will help to support a detailed picture that which type of loans are current, defaulted, delinquent etc.

Did you create any new variables from existing variables in the dataset?

I created a new variable TotalMonthlyDebt, to estimate the monthly debt at inception of loan, by multiplying the DI ratio by the stated monthly income.

To be better investigate the dataset with yearly and seasonality analysis. I broke out the loan origination date in LoanYear, LoanMonth using lubridate.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

All of the money related variables (LoanOriginalAmoun, RevolvingCreditBalance and monthlyPayment etc.) are positively skewed. I do not transform the data for univariate analysis.

Bivariate Plots Section

1 LoanOriginalAmount VS ListingCategory

First, let’s have a look at the relationship between LoanOriginalAmount with ListingCategory.

The amount of Personal loan and student use are the lowerst two. The baby and adoption loans looks similar to debt consolodation. Interesting to note that wedding loans is quite high. Let’s take a look at the numbers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2000    4000    7500    8836   13000   35000

The median and mean are 7500 and 8836, with maximum value up to 35000.

2 loan amount VS Employment status

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2500    4000    4873    6000   25000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    9000    9794   15000   35000

It is not surprised that employed borrowers are requesting loan higher then not employed borrowers The median and the mean of not employed borrower are 4000 and 4873 vs 9000 and 9794 from employed borrower.

3 Loan amount VS Income

Most of the Loan are below 10000 and monthly income is under 100000. The quantile shows that the higher the income the higher the median of the loan original amount.

The number of the data that have original amount < 10000 and annual income less then 100000 is:

## [1] 65553

which is around 57.53% of the data.

Interestingly people who borrow > 25000 has annual income of > 100000, it looks like there is some kind of rule, that if you borrow > 25000 the the minimal annual income is 100000.

Let’s verify this a bit.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  100000  115000  137250  160179  175000  800000

And yes the minimum annual income is 100000 when the loanOrigianlAmount > 25000. Next, let’s check the correlation between the 2 variables.

## [1] "Correlation between Annual income and loan amount:"
## [1] 0.2012595
## [1] "Correlation between Annual income < 100000 and loan amount < 10000:"
## [1] 0.1982141

This indicate weak positive relationship between this two variables

4 Loan amount VS credit score

## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --  820
## [1] "Relation between loan amount with credit score: "
## [1] 0.3408745

Slight positive correlation. The higher credit score, the larger the loan the borrower can get.

5 BorrowerRate VS DebtToIncomeRatio

That is not too informative. Let’s check the correlation between these 2 variables.

## [1] 0.06291678

The correlation is not significant, it show no or neglible relationship.

6 BorrowerRate VS CreditScoreRangeLower

The plots shows the higher your credit score the lower is your borrower rate.

Let’s take a look at the correlation.

## [1] -0.4615667

This indicate moderate negative relationship between this two variables

7 BorrowRate VS other variables

Let’s see the relationship of the borrower rate with other variables.

As BankcardUtilization, DeliquenciesLast7Years and PublicRecordsLast10Years increases so is the borrower rate. On the other hand the lower the RevolvingCreditBalance the lower the BorrowerRate.

Let’s check the correlation.

Correlation between BorrowerRate with BankcardUtilization:

## [1] 0.255482

Correlation between BorrowerRate with RevolvingCreditBalance:

## [1] -0.05960823

Correlation between BorrowerRate with AnnualIncome:

## [1] -0.0889818

BankcardUtilization has a weak positive relationship. The other factor has a negligible relationship.

8 CurrentCreditLines VS TotalMonthly Debt

## [1] "Correlation between BorrowerRate with AnnualIncome: "
## [1] 0.4733034

This shows a moderate positive correlation. You have more debt as you increase the # of credit lines or vice-versa.

9 Loan amount VS BorrowerRate

## [1] "Correlation between loanOriginalAmount with borrowerRate: "
## [1] -0.3289599

Moderate negative correlation between loan amount and borrower rate. Seems counterintuitive, lower interest rate on larger loans.

10 Monthly income VS TotalMonthlyDebt

## [1] "Correlation between StateMonthlyIncome with TotalMonthlyDebt: "
## [1] 0.3604392

Monthly income and total monthly debt has a positive relationship of R^2 of .36.

11 Monthly income VS Total monthly debt with facet wrapped by IsBorrowerHomeowner.

The homeowners have a higher dispersion whereas the non-homeowners are more concentrated under $5000 income.

12 A closer look at the spike in BorrowerRate

Taking a closer look at spike in the histogram of BorrowerRate, these high interest loans were made mainly for debt consolidation in 2007-2010.

## Warning: Removed 100728 rows containing non-finite values (stat_bin).

The credit score ranges are shown, the higher the borrower rate (cost), the worse the credit rating.

13 Loan amount VS loan year

Mean loans amounts drops in 2009 and increased in 2013.

14 Seasonality of Loan amount

There is a increased average loan amounts and variance during the holiday seasons and a pull back during the spring and summer months.

15 Loan amount VS Term

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2000    3500    4694    5000   25000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    5000    7276   10000   35000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2000    8000   11500   12370   15000   35000

And entry with term 60 has median of 11500, 5000 for 36 and 3500 for 12.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

In this part, I wanted to see how several features affect the borrower rate and loan amount, and the relationships between some other features.

For instance:

-The borrower rate increases as debt to income ratio increases. Moreover, The higher credit score, the lower is your borrower rate.

-The term seems to be related with loan original amount, the bigger the amount the longer the term.

-The higher credit score, the larger the loan the borrower can get.

  • There’s a negative relationship between interest rates and loan amount, the larger the loan, the lower the rate on average. That was mostly due to them having higher credit scores.

  • People who are employed on average take out larger loans than unemployed people.

  • Average loan amounts and variance are seasonal (higher during holidays)

  • There’s a strong positive relationship between open credit lines and total monthly debt

  • Monthly income and monthly debt have a positive relationship.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

No.

What was the strongest relationship you found?

The 2 strongest relationship I found was between credit score and borrower rate, and between open credit lines and total monthly debt. This makes sense since higher credit scores represent better credit decisions and can make creditors more confident that you will repay your future debts as agreed. Credit scores should be directly related to the cost of borrowing (borrower rate).

Multivariate Plots Section

1 Borrower rate VS loan amount, colored by credit score

The plot above shows the distribution of borrower rate with credit score and loan amount. We see that borrowers with higher credit score tend to have lower borrower rate and higher loan amount.only credit scores from 660 (1st quartile) and above are selected for better visual presentation.

2 Payment to Income Ratio VS years, seperated by IncomeRange

Payment to income ratio was calculated by dividing MonthlyLoanPayment by StatedMonthlyIncome This plot shows what the payment to income ratio would be by year and seperated by IncomeRange. It seems most of the borrowers with lower income ranges take on larger loans. Most of the borrowers seem to have payment/income ratio close to 0.025-0.075%. The $100k+ income range have noticeably lower payment/income ratio. Moreover, all income ranges have an increased payment/income ratio from 2009 to 2014.

3 DI Ratio VS years, seperated by IncomeRange

Comparing DI ratio, most of the borrowers seem to have DI ratio close to 20-30%, which are much lower than the payment/income ratio (2.5 -7.5%). This result indiate most of the borrowers didn’t pay off the debt everymonth. The $100k + income range have noticeably lower DI ratio at around 15-20%.

4 DI Ratio VS years, seperated by IncomeRange

It looks like the uptick in loans amount in 2009-2013 grew primarily in income range of $50-75k, $75-100k and $100k+.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

When looking at the loan amounts vs borrower rate, the credit scores demarcated borrowers by credit worthiness.

Were there any interesting or surprising interactions between features?

I looked at payment/income ratio and DI ratio and observed the higher the income, the lower the percentage of debt. Additionally, it appears the platform after a few years of maturity is beginning to lend to slight more indebted borrowers in the income ranges of $50-75k, $75-100k and $100k+. Looking at the differentce between payment/income ratio and DI ratio, it seemed like most of the borrowers didn’t pay off their debt each month.


Final Plots and Summary

Plot One

Description One

loan categories were bucketed by year and clearly it looks like pre-2010 they did not capture the loan category and debt consolidation dominated the cateogory every year subsequently.

Plot Two

Description Two

This graph shows the distribution of loan amounts by year seperated by credit score. It shows a substantial rebound in loan amounts from 2010-2013. Borrowers with larger loan amount tend to with higher credit scores. I subsetted only credit scores from 660 (1st quartile) and above for better visual presentation

Plot Three

Description Three

The borrowers with high credit scores are in red on the left. They generally have lower interest rates and larger loan amounts. I subsetted only credit scores from 660 (1st quartile) and above for better visual presentation.

Reflection

The Prosper data has a lot of variables, for this project I limited the number of variables to investigate. The difficulties I had with the data mainly from understanding the variables and then selecting the appropriate ones to analyze. After much thought, I use the variables about the finatial information and identity of borrowers and loans. I do this because I assume these are the metric that is important for lender to look at before actually lending money.

This report shows Prosper struggled initially with their first few years of loans (2006-2009). However, their business grew very quickly by the SEC in 2009. Their main loan categories are in debt consolidation and I believe the low interest rate environment has helped the supply of lenders looking for yield. This was a great learning experience about the P2P lending model.

There are a number of different ways to take this project further. Firstly, I’ve focused on a small subset of the variables available in the dataset to explore the investors side a bit more; look at investor profit and losses and their general activity in the peer-to-peer lending industry. Also, I’d like to incorporate some machine learning algorithm such as logistic regression and cluster into my exploration to predict the number of ongoing loans that will end up being defaulted. I believe that an accurate model will be very beneficial to Prosper.